Skip to content

RSDK-10618: Fix data race when modules crashes during reconfiguration #4975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 16, 2025

Conversation

jmatth
Copy link
Member

@jmatth jmatth commented May 7, 2025

This has a matching change in viamrobotics/goutils#432 that actually fixes the race condition. Unfortunately the way modmanager is written meant that fixing the race caused a deadlock, which this PR addresses.

@jmatth jmatth requested a review from dgottlieb May 7, 2025 16:41
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 7, 2025
// There is a circular dependency that causes a deadlock if a module dies
// while being reconfigured. Break it here by giving up on the restart if we
// cannot lock the mananger.
if locked := mgr.mu.TryLock(); !locked {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, if a module crashes at the same time the mod manager has the lock for an unrelated reason (perhaps adding a resource to a different module), we'll give up on restarting the module? Is there some other means that would restart it?

And I suppose the problem (based on the comment) from just using the mod.inRecoveryLock here is that module reconfiguration does not take the in recovery lock, but would need to (for the purpose of serializing with restarts -- which it definitely does from a logic perspective)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll give up on restarting the module

Correct

Is there some other means that would restart it?

Not that I'm aware of

And I suppose the problem (based on the comment) from just using the mod.inRecoveryLock here is that module reconfiguration does not take the in recovery lock...

Correct, that lock was only ever taken by the restart callback. I actually removed it (and that weird atomic bool) since taking the modmananger lock made it unnecessary.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 13, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 13, 2025
@viambot viambot removed the safe to test This pull request is marked safe to test from a trusted zone label May 14, 2025
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 14, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 14, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 14, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 14, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 15, 2025
@jmatth
Copy link
Member Author

jmatth commented May 15, 2025

Ok, this should be good to go now. Sorry for the extra noise that moving module and it's methods into their own file causes. The functional changes are all in modmanager:

  • When removing a module we now always set pendingRemoval even if it doesn't have any resources
  • In the unexpected restart handler, we
    • Take the manager lock to avoid racing with any Reconfigure calls or similar
      • This also removed the need for separate restart locks / atomics on the module itself so I removed them
    • Check if the module is either pending removal or the process is already running, and abandon the restart attempt if so

@jmatth jmatth requested a review from dgottlieb May 15, 2025 14:34
// closed.
// Always mark pendingRemoval even if there is no more work to do because the
// restart handler checks it to avoid restarting a removed module.
mod.pendingRemoval = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, just scrutinizing this variable now that I know it exists. This variable is also used to "hide" modules that are in shutdown, but otherwise still up. I can't comment off-diff (obligatory: GitHub PR is not a professional CR tool). But in modManager.getModule, we have this code:

func (mgr *Manager) getModule(conf resource.Config) (foundMod *module, exists bool) {
	mgr.modules.Range(func(_ string, mod *module) bool {
		var api resource.RPCAPI
		var ok bool
		for a := range mod.handles {
			if a.API == conf.API {
				api = a
				ok = true
				break
			}
		}
		if !ok {
			return true
		}
		for _, model := range mod.handles[api] {
			if conf.Model == model && !mod.pendingRemoval {
				foundMod = mod
				exists = true
				return false
			}
		}
		return true
	})

	return
}

Does that set us up for a race when "delayed" removal of module foo V0 + adding a new component depending on module foo V1?

Most of the mod manager operations are under a global lock (at least the ones related to module process lifetimes). So I suppose we're good. But just asking out loud now that I've been introduced to this nuanced module state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I don't think this set of changes creates any new races, since it's just setting the pendingRemoval field on modules that are being removed from the manager anyway. That said, it looks like a lot of codepaths to getModule don't take the lock (I guess because they're just reads?), so there may be more preexisting race conditions there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obligatory: GitHub PR is not a professional CR tool

New 20% project: set up Viam Gerrit instance?

return
}

// Something else already started a new process while we were waiting on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. I know you were describing how the code likes to re-use the same module object. But I think we'll want something more satisfying than "something else did something we only know how to observe via a Status call".

What's the sequence of public API calls + module crashes that leads to this?

Copy link
Member Author

@jmatth jmatth May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases we need to account for: module process crashes while being removed, and module process crashes while being reconfigured. The case this check addresses is the crash during reconfigure. The modmanager will kill and restart the module process, so by the time the exit handler has the lock to proceed a new process may already be running and we need to avoid starting a second one. Here's by best attempt at showing the goroutines interleaving in markdown form:

modmanager goroutine module.process.manage goroutine
manager.Reconfigure() <call to cmd.Wait() returns w/ err>
<manager.mu locked> <module.process.mu locked>
manager.closeModule()
module.stopProcess()
module.process.Stop()
<block on module.process.mu>
<reaches OUE handler, module.process.mu unlocked>
<module.process.mu unlocked, lock it and complete call to Stop> manager.newOnUnexpectedExitHandler
manager.startModule() <block on manager lock>
<various further calls assign a new running managedProcess to module.process>
<manager.mu unlocked>
<unblocked, lock manager.mu>
<module.process.Status() == nil, abort restart>

The first fix I tried was passing in the managedProcess pointer to the exit handler along with the exit code and then checking if it was the same as the pointer stored in module.process. If the pointers differed then we lost the race and a new managedProcess instance had been set up on the module, so we should abort the restart. That felt cleaner than checking the current process status but required changing a public API in pexec so I replaced it with the current implementation. I could probably find/create some other unique value that changes each time the module restarts to accomplish the same thing if we don't want to check the process status. Top-of-mind are:

  • Create a new module struct each time the module is restarted and compare the module pointers.
  • Use ManagedProcess.ID() to check if the process instance has changed. The docstring on ManagedProcess.ID() says it "returns the unique ID of the process", but currently we just set it to the module name which in this context is not unique. I could add some random or incrementing value to that string to make it unique but I'm not sure if anything else relies on that ID to map back to module names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense. Thanks for the detailed writeup. I'm going to leave some remarks re: documentation. But in hindsight, what you have is pretty good already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with the assessment re: checking a managed process pointer (or something more logical). That would be preferable, but not so preferable to break the API further.

// Log error immediately, as this is unexpected behavior.
mod.logger.Errorw(
"Module has unexpectedly exited.", "module", mod.cfg.Name, "exit_code", exitCode,
)

// Take the lock to avoid racing with calls like Reconfigure that may make
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested documentation:

Noting that a module process can be restarted while preserving the same `module` object. And consider the case where a module is being restarted due to a configuration change concurrently with the existing module process crashing. There are two acceptable interleavings:
1. The `onUnexpectedExitHandler` restarts the module process with the old configuration. 
  1a) and the Reconfigure then shuts down + restarts the (freshly launched) module process with one using the updated configuration.
2. Or, the `Reconfigure` executes and starts the module process[1] with the updated config. The `onUnexpectedExitHandler` will still run. But will become a no-op.

For the second scenario, we check our assumptions after acquiring the modmanager mutex.  If the module process is running, there is nothing for us to do.

[1] I'm seeing that reconfigure has this code for stopping the existing module process:

	if err := mod.stopProcess(); err != nil {
		return errors.WithMessage(err, "error while stopping module "+mod.cfg.Name)
	}

Have we confirmed that, in our case of racing with a crash, stopProcess there will not return an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both outcomes of the race (either Stop or manage getting the process lock first) the call to managedProcess.Stop() will return an error reporting that the process it tried to kill did not exist, which we whitelist and stop propagating in module.stopProcess, so the module stop code will proceed normally.

Copy link
Member

@dgottlieb dgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimistically approving. Left a comment that just adds more words to the race your documentation alludes to. Feel free to correct any inaccuracies or rewrite to your tastes.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 16, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 16, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 16, 2025
@jmatth jmatth merged commit 915c3bc into viamrobotics:main May 16, 2025
19 checks passed
@jmatth jmatth deleted the RSDK-10618 branch May 16, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants